URL as starting point for WWW document categorization

نویسندگان

  • Vojtech Svátek
  • Petr Berka
چکیده

Information about the category (type) of a WWW page can be helpful for the user within search, filtering, as well as navigation tasks. We propose a multidimensional categorisation scheme, with bibliographic dimension as the primary one. We examine the possibilities and limits of performing such categorisation based on information extracted from URL, which is particularly useful for certain on-line applications such as meta-search or navigation support. In addition, we describe the problem of ambiguity of URL terms, and suggest a method for its partial overcoming by means of machine learning. As a side–effect, we show that general purpose WWW search engines can be used for providing input data for both human and computational analysis of the web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier *

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...

متن کامل

Weight adjustment schemes for a centroid based classifier ∗

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...

متن کامل

Web-Specific Genre Visualization

User interfaces to WWW search engines typically present results as ranked lists of documents. Such lists give users little help in understanding document variation: we propose a richer representation of retrieval results in the search interface. Fundamental to us is the notion of document grouping. We use both stylistic genre-based document categorization and statistical content-based clusterin...

متن کامل

A Domain Cluster Interface for WWW Search

Because of the recent explosive increase in the number of WWW documents, directory services are indispensable in finding needed documents. In the keyword search function of most directory services, search results are displayed as a URL list ordered by importance calculated by the system, but the order sometimes does not have any meaning to the user since the calculation algorithm is a black box...

متن کامل

Centroid-Based Document Classification: Analysis & Experimental Results

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes of documents, is an important task that can help both in organizing as well as in finding information on these huge resources....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000